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1.  INTRODUCTION 

DBC  is  a database  computer  for  very  large  databases  [1,2].  A major  function 
of  DBC  is  to  carry  out  rapid  content-search  cost-effectively,  since  a database 
system  (which  can  be  supported  on  DBC)  makes  use  of  this  operation  quite 
heavily.  By  using  large  mass  memory  (MM)  blocks  (disk  cylinders)  and  by 
incorporating  tracks-in-paral lei  readout  mechanisms,  every  one  of  the  blocks 
can  be  searched  individually  by  means  of  a set  of  parallel  microprocessors. 
Consequently,  large  amounts  of  data  can  bo  examined  rapidly.  By  making  use  of 
the  clustering  mechanism  of  DBC's  controller  (DBCCP)  and  of  the  directories 
stored  in  the  structure  memory  (SM) , it  is  ensured  that  only  a few  blocks 
need  be  searched  in  order  to  answer  a query  thereby  achieving  higher  access 
precision  in  response  to  user  requests. 

The  order  in  which  data  are  presented  to  the  user  is  often  of  some  concern. 

In  conventional  machines,  the  process  of  ordering  records,  if  the  ordering  is 
to  be  based  on  data  item  values,  is  rather  slow.  However,  the  speed  of  this 
process  is  general Iv  compatible  with  the  speed  at  which  records  are  searched 
which  is  also  slow.  The  use  of  DBC  greatly  speeds  up  the  search  process.  In 
order  to  maintain  comparable  ordering  speed  so  that  the  record-ordering  process 
does  not  cause  a bottleneck,  it  is  necessary  that  DBC  be  provided  with  a 
fast  hardware  sorter. 

The  post  processor  (HP)  of  DBC  has  the  capability  of  sorting  records,  as 
well  as  carrying  out  other  post-processing  functions,  such  as  computing  set 
functions  (max,  min,  average,  count,  sum)  and  performing  the  equality-join 
operation.  Once  the  response  set  of  a user  query  has  been  identified  by  the 
mass  memory,  this  set  of  records  can  be  sorted  by  the  post  processor.  Instead 
of  sorting  all  records  retrieved  by  the  mass  memory,  only  those  that  are 
meaningful  to  the  user  are  actually  sorted.  Sorting  is,  therefore,  incorporated 


as  a post-processing  function  rather  than  as  a function  of  the  mass  memory. 
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Thus,  sorting  of  records  that  have  been  retrieved  in  response  to  a particular 
query  can  be  performed  in  conjunction  with  the  evaluation  of  another  query  by 
the  mass  memory.  Furthermore,  the  mass  memory  remains  simple,  since  it  does 
not  Involve  with  the  ordering  of  retrieved  records. 

2.  BRIEF  REVIEW  OF  PARALLEL  SORTING 

Gcrtnln  sorting  methods  (such  as  distributive  partitioning  13])  are  avail- 
able which  use  a single  processor  to  sort  N elements  in  an  expected  time  of  0(N) . 
Even  though  these  methods  are  as  fast  as  any  sorting  method  can  be,  they  rely 
on  the  use  of  random  access  memory  and  an  amount  of  memory  which  is  O(log  N) , 
instead  of  0(N).  (All  logarithms,  as  used  in  this  report,  are  with  respect  to 
the  base  of  2.)  Furthermore,  depending  on  the  distribution  of  values  of  the 
sort  attribute,  the  worst  case  time  for  sorting  is  0(N  log  N) . 

Sorting  methods  that  do  not  rely  on  the  use  of  random  access  memory  are 
also  available.  Some  of  these  methods  make  use  cf  a single  processor  and  a 
linear  amount  of  sequential  memory,  i.e.,  0(N)  memory,  to  sort  N elements  in 
0(N  log  N)  time.  One  of  these  methods  is  an  adaptation  of  the  merge-sort 
method  of  Knuth  [4,  pp.  159-168]. 


With  the  use  of  P parallel  processors  and  a linear  amount  of  sequential 
memory,  our  aim  is  to  achieve  a sorting  time  of  0((N  log  N)/P)  for  sorting  N 
elements.  This  is  because  a single  processor  can  do  the  task  in  0(N  log  N) 
time,  and  it  may  be  expected  that  P processors  should  do  the  task  P times 
faster.  Unfortunately,  there  is  no  known  way  to  partition  the  problem  into 
P parts  so  that  each  of  the  P processors  can  work  on  a single  part  and  execute 
it  in  0((N  log  N)/P)  time. 

There  exist  parallel  sorting  methods  (such  as  the  rebound  sorter  ]5]) 
which  use  P relatively  simple  processors  (comparators)  to  sort  P elements  in 

t 

0(P)  time.  There  also  exist  parallel  sorting  methods  (such  as  the  ones  in  ]6]  and 
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[7])  which  use  P processors  to  sort  P elements  in  0(»P  + log'P)  time.  The 
latter  methods  use  an  Illiac  IV  type  of  two-dimensional  structure  of  processors. 
Another  method  by  Stone  [81  uses  P processors  to  sort  P elements  in  0(log~P) 
time.  Even  though  these  methods  are  quite  fast,  they  suffer  from  the  fact  that 
groups  of  records  larger  than  P in  number  will  have  to  be  sorted  in  separate 
batches  of  P records  and  then  merged.  Other  methods  reported  in  the  literature 
also  have  the  same  problem. 

3.  OUR  OBJECTIVE 

We  shall  propose  a method  in  which  P processors  can  sort  N elements 

■> 

(where  N ' P)  in  0( (N/P) log(N/P)  + (N/P)log~P)  time.  In  other  words,  !’ 

processors  will  sort  P elements  in  0(log~P)  time.  Besides,  more  than  I'  records 

can  be  sorted  as  a single  batch.  In  fact,  if  each  processor  has  enough  memory 

to  hold  M records,  then  as  mam  as  MP  records  can  be  sorted  as  a single  batch. 

•) 

Also  note  that  when  N " P such  that  log  N log“P,  then  the  sorting  time 
approaches  0((N/P)log(N/P))  which  is,  of  course,  the  best  that  can  be  expected 
from  T processors.  The  method  uses  P processors,  each  with  sonu  sequential 
memory  and  each  processor  connected  directly  to  log  P other  processors  (say, 
by  using  routing  registers  as  in  111  lac  IV). 

4.  THE  BITONIC  SORT 

Our  sorting  method  is  a variation  of  the  bitonic  sorting  method  originally 
proposed  by  Batcher  [l'l.  Batcher's  bitonic  sorting  method  is  based  on  the  concept 
of  biton ic  sequence.  A sequence  S » (si ,s2 , . . . sN)  is  bitonic  if  there  is  an 
index  k,  where  l < k < N,  such  that  either 


(1) 

si  < s2  v . 

. . v sk  s (k+l ) •»  . . . 

' sN  or 

(ID 

si  > s2  N 

. . '•sk  *■  s(k+l)  " . . . 

- sN 

If  a sequence  S * (si ,s2, . . . , sN)  is  bitonic  then  so  are  the  sequences  S'odd  ■ 


i 


(si , s3, s3, . . . ) and  S'even  ■ (s2 , s4 , sb , . . . ) . It  can  be  shown  |d]  that  to  sort  a 
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bitonic  sequence  S ■ (si ,s2 , . . . , sN) , it  is  only  necessary  to  recursively 
sort  the  two  bitonic  subsequences  S’odd  and  S 'even  separately  and  then  merge 
the  two  resultant  sequences  by  simply  carrying  out  an  element  by  element 
comparison.  If  S’odd  is  sorted  to  produce  (xl,x3,x5, . . . ) and  S'even  ,'s 
sorted  to  produce  (x2 , x4 ,x6 , . . . ) , then  the  sorted  version  of  the  original 
sequence  S Is  simply 

(ml  n (x  I , x?)  ,max  (x  I ,x?),mln(xl,x4)  ,max(x3,x4) , . . . ) . 


4 . 1 An^  Improvemen t 

Our  method  of  sorting,  as  we  have  already  mentioned,  is  a variation  of 
Batcher's  bitonic  sorting  method.  This  method  sorts  N elements,  where  N is 
a power  of  2.  The  method  is  based  on  the  fact  that  if  a sequence  S - (si, s2 , . . . , sN) 
is  bitonic,  then  so  are  the  sequences 

S'small  - (min(sl,s[N/2+l )) ,min(s2,s[N/2+2]) , 

. . . ,min(s[N/2J ,sN))  and 
S'large  - (max (si , s [N/2+1 ] ) ,max(s2 , s[N/2+2 ] ) , 

. . . ,max(s[N/2] ,sN)) 

Furthermore,  every  element  of  S'small  is  no  larger  than  any  element  of  S'large. 

Thus,  it  is  only  necessary  to  sort  the  bitonic  subsequences  S'small  and  S'large 
separately  and  then  concatenate  them  (instead  of  merge  them)  in  order  to  got 
the  sorted  version  of  the  given  sequence  S.  In  other  words,  to  sort  a bitonic 
sequence  S “ (si , s2 , . . . , sN) , we  compare  the  two  halves  of  the  sequence,  element 
by  element,  put  the  smaller  elements  in  one  subsequence  and  the  larger  ones  in 
another,  sort  these  two  bitonic  subsequences,  and  finally  concatenate  the  two 
sorted  subsequences.  This  idea  may  be  used  in  sorting  fixed  numbers  of  records 
in  a mesh-connected  computer  [61.  But  we  propose  to  use  this  improved  variation 
of  the  bitonic  sorting  method  in  order  to  parallel  sort  arbitrary  numbers  of 
records  with  P processors  having  log  P interconnections. 
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4,2  Proof  of  Correct  mss  of  the  Improved  Bltonie  Sort 

To  prove  correctness  of  the  improved  hltonic  sorting,  method  we  first 
present  a definition  and  a lemma: 


Def  lnlt  ion:  Oiven  an  ordered  pair  of  arbitrary  numbers  Mil,  n2  • , a compare- 
arul- int ejcluntge  operation  compares  the  numbers  nl  and  n2  and  permutes  them 


into  an  ordered  pair 
(i)  Mil,  n2'. 


(ll)  mi2,  nl-. 


if  ul  < n2 


if  nl  ' n2. 


Lemma:  Suppose  there  is  an  algorithm  that  sorts  any  arbitrary  sequence  ot  Os 

and  is  into  a non-deereas i ug  order  by  simply  performing  compare-and- interchange 
operations.  1'heu  the  algorithm  will  sort  any  arbitrary  sequence  ot  arbitrary 
numbers  into  a non-decreasing  order.  The  proof  of  the  lemma  (stated  in  terms 
of  sorting  networks)  appears  in  |4,  pp.  224). 

Making  use  of  t lie  above  lemma,  we  have  only  to  show  that,  given  a hltonic 
sequence  S * (si , s2 , . . . , sN)  of  Os  and  Is,  the  two  subsequences 
S' small  “ (min (s 1 , s( N/2+1 ! ) ,min (s2 , s [ N/2+2 ] ) , 


,min(s t N/2 1 , sN) ) and 


S'latge  ” (max (si , s[N/2+ l | ) ,max(s2 , s[ N/2+2 1 ) , 

. . . ,maxt,s  [N/2  | ,sNl ) 

are  hltonic,  and  that  every  element  ot  S'small  is  no  larger  than  any  element  of 
S’  large. 

The  proof  considers  two  cases:  (i)  the  given  sequence  S consists  ot  a 
sequence  of  Is,  followed  bv  a sequence  of  Os,  followed  by  a sequence  ot  Is; 

(il)  the  given  sequence  S consists  of  a sequence  of  Os,  followed  bv  a sequence 
of  Is,  followed  bv  a seuqence  of  Os.  The  two  cases  are  proved  graphical Iv  and 
shown  separately  In  Figure  1.  In  the  figure,  a seqmnce  of  x Is  or  x Os  is 

X X 

represented  as  1 or  0 , respectively.  For  each  ot  the  two  cases,  there  are 


four  subcases  depending  on  the  position  of  the  mid-point  of  the  given  sequence 
S.  Each  subcase  corresponds  to  a row  in  the  figure.  The  sequence  S in  each 
subcase  is  diagrammed  in  the  rightmost  column  of  the  figure,  where  Is  are  represented 
at  a higher  level  than  Os.  We  notice  that,  in  every  possible  subcase,  S'small 
and  S' large  are  bitonic  and  that  every  element  of  S'small  is  no  larger  than 
any  element  of  S' large. 

4. 3 The  Parallel  Sorting  Algorithm 

In  our  algorithm  using  P parallel  processors,  we  assume  that  each  of  the 
P processors  has  enough  memory  to  accommodate  M records.  Altogether  there  are 
MP  records.  P is  a power  of  2,  but  M can  be  any  positive  number,  not  necessarily 
a power  of  2.  The  processors  are  numbered  0 through  (P-1)  and  the  M records 
in  processor  i,  for  0 i _<  P-1 , are  named  R [ i , 1 ] , R[i,2],  ...,  R[i,M], 
respectively.  Figure  2 depicts  the  record  indexing  scheme. 

4.3.1  Illustration  - Example  with  Actual  Records 

The  manner  in  which  the  P parallel  processors  sort  MP  records  is 
illustrated  by  means  of  an  example.  We  have  P=4  and  M=5.  The  initial 
configuration  of  records  (represented  by  sort  values  alone)  is  shown  in  the 
left  hand  side  of  Figure  3.  (1  + log  P)  steps  are  involved  in  the  sorting 

process.  Step  0 involves  one  pass  over  all  the  records,  Step  1 involves  two 
passes,  etc. 

Step  0 (as  depicted  in  Figure  3) 

Pass  l.  Each  processor  sorts  its  own  records.  The  odd-numbered  processors 
sort  in  non-decreasing  order,  the  even-numbered  processors  sort  in 
non- increasing  order. 

Step  1 (as  depicted  in  Figure  4) 

In  this  step,  there  are  2 bitonic  sequences  in  the  two  pairs  of  processors 
(0,1)  and  (2,3),  respect ively . We  are  to  sort  each  of  these  bitonic  sequences. 
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0 

1 


P-1 


Processor 

Numbers 


Records 

R[0, 1 ] R | 0 , 2 ] . . . R [ 0 , M 1 


R [ 1 , 1 1 K 1 1,2]  . . . R 1 1 , M 1 


R[ 2 , 1 1 R l 2 , 2 ] . . . Rl2.Nl 


RfP-1,11  R [P-1, 2]  . . . R| P-1 .Ml 


FIGURE  2.  Record  Indexing  Scheme  When  There 
Are  P Processors  and  M Records 
Per  Processor 
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(e.^.,  K|2,4|  Is  IS;  moro  aoourat o 1 v , the  sort  valu 

of  tho  record  K|2,s]  (s  1S.1 
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(Compare 


FIGURE  4.  Example  - 


1 1 


Pass  1.  Processors  0 and  I compare  records  (K|0,1]  with  K|l,l),  R|0,2]  with 
R[l,2],  etc.);  the  smaller  records  (t.o.,  those  having  the  smaller 
sort  values)  are  placed  In  processor  0 and  the  larger  ones  in 
processor  1.  Processors  3 and  2 (In  this  order)  do  the  same. 

Pass  2.  Kach  processor  sorts  Its  own  records.  Processors  0 and  I sort  in 

non-decreasing  order  while  processors  2 and  3 sort  in  non-increasing 
order . 

Step  (as  depicted  in  Figure  *>) 

In  this  step  there  is  l bitonic  sequence  in  the  quadruple  of  processors 

(0,1, 2, 3).  We  are  to  sort  this  bitonic  sequence. 

Pass  1.  Processors  0 and  2 compare  records  (K 1 0 , 1 | with  R|2,l|,  K|0,.']  with 
R[2,2j,  etc.);  the  smaller  records  are  placed  in  processor  0 and 
the  larger  ones  in  processor  2.  Processors  1 and  3 (iti  this  order) 
do  the  same. 

Pass  2.  Processors  0 and  1 compare  records;  the  smaller  records  arc  placed  in 
processor  0 and  the  larger  ones  in  processor  1.  Processors  and  1 
do  the  same. 

Pass  3.  Each  processor  sorts  its  own  records  in  non-decreasing  order. 

The  final  configuration  of  records  is  as  shown  in  Figure  5. 


4.3.2  Illustration  of  the  Steps  in  the  Algorithm 

The  algorithm  for  parallel  sorting,  using  the  improved  hitonic  sorting 
method,  Is  illustrated  In  Figure  h.  There  are  16  processors,  so  that  il  ♦ log  In) 
or  5 steps  are  involved.  In  each  stop  there  can  he  several  passes  over  the 
records,  the  last  of  which  Involves  localized  sorting  (where  each  processor 
sorts  its  own  records).  Focalized  sorting  is  indicated  by  a horizontal  arrow, 
directed  right  for  non-decreasing  order  and  directed  left  for  non- increasing 
order.  A curved  arrow  from  processor  i to  processor  1 (directed  toward  pro 
cessor  ))  indicates  that  i and  | must  perform  record  by  record  comparisons, 
store  the  smaller  ones  In  i and  the  larger  ones  in  |. 

4.3.3  The  Algorithm 

tn  ail  the  procedures  that  constitute  the  parallel  sorting  algorithm, 

P Is  the  number  of  processors,  M is  the  number  ol  records  per  processot  and 
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Step  is  a variable  for  the  step  number.  A proc-set , defined  with  respect  to 
any  particular  pass  in  some  step  of  the  algorithm,  is  a group  of  consecutive 
processors  such  that  any  two  proc-sets  have  the  same  size  (which  is  the  number 
of  processors  in  the  proc-set)  and  all  processors  in  a proc-set  sort  in  the 
same  direction  (either  all  of  them  sort  in  a non-decreasing  order  or  all  of 
them  sort  in  a non-increasing  order).  Each  proc-set  has  two  processors 
during  the  second-to-last  pass  in  any  step,  each  proc-set  has  four  processors 
during  the  third-to-last  pass  in  any  step,  each  proc-set  has  eight  processors 
during  the  fourth-to-last  pass  in  any  step,  etc.  The  size  of  a proc-set 
is  always  a power  of  2.  The  variable  Diff  (meaning,  difference)  indicates 
that  processor  i must  interact  (during  some  pass  in  some  step)  with  processor 
j,  where  j is  either  (i  + Diff)  or  (i-  Diff).  The  logical  variable 
Direction  indicates  the  following: 

(i)  When  localized  sorting  is  necessary,  it  must  be  done  in  non- 
decreasing order  if  Direction  = 0 and  in  non-increasing  order  if 
Direction  = 1. 

(ii)  When  processor  i is  to  interact  with  processor  j,  those  records 
with  smaller  sort  values  must  be  stored  in  i and  those  with 
bigger  ones  in  J if  Direction  = 0,  and  the  roles  of  i and  j are 
reversed  if  Direction  = 1. 

A.  Procedure  Executed  by  the  Post  Processing  Controller 
Step  : ” 0 

Broadcast  ('localized  sort'.  Step) 

Step  1 

while  Step  log  P 
do  Diff  :»  2 **  Step 
while  Diff  ^ 1 
do  Diff  Diff/2 

Broadcast  ('merge',  Diff,  Step) 

end 

Broadcast  ('localized  sort'.  Step) 

Step  Step  + 1 


end 
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The  post  processing  controller  coordinates  the  execution  of  the  sorting 
algorithm  by  broadcasting,  during  every  pass,  a command  ('merge'  or  'localized 
sort')  and  the  arguments  (Diff  or  Step  or  both)  to  all  the  P processors.  All 
the  P processors  can  work  in  parallel  to  execute  each  command  broadcasted  by 
the  controller. 


B.  Procedures  Executed  by  a Processor  (Processor  i) 

Processor  i executes  a 'merge'  command  (which  has  the  arguments  Step  and 
Diff)  as  follows: 

Proc-set-size  :=  2 * Diff 

A :=  i mod  Proc-set-size  /*  Processor  i is  the  A-th  one  in  its  proc-set  */ 
Direction  :=  (Step+l)-th  lowest  significant  bit  of  the  number  i (whore  the 
absolute  lowest  significant  bit  is  the  1-th  or  first  lowest 
significant  bit) 
if  A < Diff  then 
do  j :=  i + Diff 

if  Direction  = 0 then  Send ( j ) else  Receive  (j ) 

end 

else 

do  j :=  i - Diff 

if  Direction  = 0 then  Receive(j)  else  Send(j) 

end 

The  procedures  Send  and  Receive  are  defined  as  follows: 

Procedure  Scnd(j)  /*  as  executed  by  processor  i */ 

Count  :=  1 /■*■  count  of  records  in  processor  i or  j */ 

while  Count  M 

do  send  the  next  record  R[i, Count]  to  processor  j 
wait  for  the  return  record  from  j 
call  this  record  R[i, Count] 

Count  :=  Count  + 1 

end 

end  procedure  Send 

Procedure  Receive(j)  /*  as  executed  by  processor  i */ 

Count  :=  1 
while  Count  <_  M 

do  wait  for  the  next  record  R[j, Count]  from  processor  j 
compare  this  record  with  own  next  record  R[i, Count] 
if  the  sort  value  of  R[i, Count]  is  smaller,  then 
interchange  R[ i, Count]  with  R[j, Count] 
send  R l j , Count]  back  to  processor  j 
Count  :=  Count  + 1 

end 

end  procedure  Receive 


Whenever  processors  i and  J need  to  merge  records,  then  the  one  which 

ought  to  keep  the  smaller  records  executes  Its  Send  procedure  and  the  other 

processor  executes  Its  Receive  procedure.  If  processor  1 ought  to  keep  the 

smaller  records,  then  it  sends  its  records,  one  at  a time,  to  j;  processor 

| then  keeps  tin*  bigger  records  (after  record  by  record  comparison!  and 

sends  the  smaller  ones  hack  to  processor  i. 

Finally,  anv  processor  i executes  a 'localised  sort'  command  issued 

by  the  controller  with  the  argument  Step,  bv  doing  the  following: 

Direction  :»  (Step-H)-th  lowest  significant  bit  of  number  i 
if  Direction  " 0 then  sort  all  the  M local  records  in 

non-decreasing  order,  else  sort  in  non-increasing  order 

Localised  sorting  can  be  done  in  the  O-th  step  by  using  a merge  sort  method 

14,  pages  119-lh8],  Localised  sorting  during  any  other  step  involves  sorting 

a bitonic  sequence,  and  can  be  done  bv  simply  merging  records  starting  at  the 

two  ends. 

1.  I NTFROONNKOT 1 ON  OF  PRtXlKSSORS 

It  mav  be  noted  that  this  parallel  sorting  algorithm  only  requires  a 
processor  to  interact  (i.e.,  compare  records)  with  exactly  log  P other 
processors,  where  I’  is  the  total  number  of  processors.  By  providing  direct 
Interconnect  ion  of  each  processor  with  log  V others,  we  can  ensure  that 
rout  tug  a record  from  one  processor  to  another  requires  exactly  one  routing 
step  (since  the  record  does  not  have  to  go  through  any  intermediate  processor). 
For  example,  if  P»8,  we  have  the  following  interconnections: 

Processor  0 is  connected  to  Processors  1.2,4 

Processor  1 is  connected  to  Processors  0,3,1 

Processor  2 is  connected  to  Processors  0,  1,0 

Processor  l is  connected  to  Processors  1,2,7 

Processor  i is  connected  to  Processors  0,1,6 


Processor  5 is  connected  to  Processors  1,4,7 


Processor  6 is  connected  to  Processors  2,4,7 
Processor  7 is  connected  to  Processors  3,5,6 
The  algorithm  for  finding  the  interconnections  for  any  processor  i during  the 
time  the  post  processor  is  initially  designed  is  given  as  follows: 
Proc-set-size  :«  2 

while  Proc-set-size  < P /*  P is  the  total  number  of  processors  */ 
do  A : = i mod  Proc-set-size  /*  Processor  i is  the  A-th  processor  in 

its  proc-set  */ 

Diff  :■  Proc-set-size/2 

J_f  A < Diff  then  .1  i + Diff  else  J : - i - Diff 
connect  processor  i to  processor  j 
Proc-set-size  : = Proc-set-size  * 2 

end 


The  layout  of  processors  md  their  interconnections  are  shown  in  Figure  7, 
for  various  values  of  P,  the  total  number  of  processors.  The  actual  arrange- 
ment (or  positions)  of  P processors  in  a loop  can  be  defined  recursively  as 


follows: 


1)  For  P = 4,  the  arrangement  of  processors  (in  a loop)  is  as  shown 
for  this  particular  case  in  Figure  7. 


!)  For  P 


, where  n ' 1,  start  with  the  arrangement  ot  processors 


for  P = 2 . Call  this  arrangement  Tl.  From  Tl,  create  another  loop 
of  processors  T2  by  simply  renumbering  every  processor  i (of  Tl) 
as  i + 2°.  Rotate  T2  180  degrees  about  the  horizontal  axis.  Place 
T2  underneath  Tl.  Split  Tl  at  the  lowest  point  where  the  vertical 


axis  intersects  the  loop  (i.e.,  Tl).  Split  T2  similarly  but  at  the 
highest  point.  Connect  the  ends  of  Tl  with  those  of  T2  thus  forming 


a new  loop. 


FIGURK  7.  tntoroonnoct  ion  of  Processors  for  Post  Process  l ii).: 
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Let  r represent  the  amount  of  time  required  to  route  a single  re>-oi\l  t tow 
one  processor  to  another.  Let  e denote  the  time  required  to  compare  taud 
interchange,  it'  necessary)  two  records  bv  the  same  processor.  There  are  K 
processors  and  M records  per  processor. 

The  amount  of  time  required  by  each  processor  to  do  localized  sorting 
in  the  beginning  (i.e.,  during  Step  0)  is  given  by 


(N  log  M)c 


where  we  assume  that  M records  can  be  sorted  bv  a processor  by  using  M log  M 
comparison  (and  Interchange!  operat ions. 

During  all  other  steps  (and  there  are  log  V other  steps),  localized 
sorting  can  be  done  by  simply  merging  the  records  from  the  two  ends.  Kverv  such 
sorting  step  requires  M comparison  (and  interchange!  operat ions.  Hence,  the 
total  time  for  localized  sorting  after  Step  0 is  given  by 

(M  log  P)e 

A merge  operation  between  t wo  processors  requires  .'M  record  routing 
operat ions  and  >1  comparisons  (and  Interchanges,  if  necessary! , The  ’merge’ 
command  is  executed  once  in  Stop  1,  twice  in  Step  , etc.  and  log  l'  times  in 
the  final  step.  Hence  the  total  time  for  merging  is 
(2Mr  + Me)  (1  + 2 + ...  + log  l"» 

- (2Mr  + Me)  (1  +■  log  l"i  log  T 2 
Thus,  the  total  time  for  sorting  is 

Me  (log  M + log  r)  + (2Mr+Mc)  (1+log  l')log  V 2 
Hence,  the  order  of  t imo  is  given  by 
0(M  log  M + M log’r) 
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7.  PROCESSOR  UTILIZATION 


It  may  be  expected  that,  in  any  given  amount  of  time,  P processors  should 
perform  P times  as  much  work  as  can  a single  processor  working  alone.  Unfortu- 
nately, due  to  the  simultaneous  need  for  the  same  resource  by  a number  of 
processors,  this  expectation  can  rarely  be  materialized.  It  may  be  of  interest, 
therefore,  to  quantify  the  utilization  of  processors,  in  terms  of  an  efficiency 
measure.  We  define  the  ef f ic iency  of  a processor  (in  the  execution  of  a given 
task)  as  the  ratio  between  tin-  share  11  of  the  work  actually  performed  by  the 
processor  in  unit  time  and  the  work  W it  would  have  performed  in  unit  time  if 
it  were  acting  alone. 

In  the  context  of  sorting  algorithms,  let  us  say  that  a single  processor, 
acting  alone,  can  sort  MP  elements  in  bMP  log(MP)  units  of  time,  where  b is  a 
constant.  In  that  case,  the  processor,  acting  alone,  can  perform  l/(bMP  log(MP)) 
amount  of  work  in  one  unit  of  time.  Therefore,  if  there  are  P processors 
working  in  parallel  to  do  the  same  job,  then  the  maximun  amount  of  work  any  one 
of  these  P processors  can  be  expected  to  perform  in  one  unit  of  time  is 
W = 1/ (bMP  log (MP) ) 

With  respect  to  oui  parallel  sorting  algoritlim,  we  have  noted  earlier 
that  the  total  time  for  sorting  MP  elements  with  the  help  of  P parallel 
processors  is  given  by 


cM  log  (MP)  + dM  log*-  P 

where  c and  d are  constants.  The  reciprocal  of  this  number  is  the  amount 
of  work  performed  by  all  the  P processors  together  in  one  unit  of  time.  Thus, 
the  amount  of  work  performed  bv  any  one  of  the  P processors  in  unit  time  is 


given  by 

H *=  (l/P)  ( 1 / (cM  log(MT)  + dM  log~P)) 
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Hence,  the  efficiency  E of  a processor  in  the  parallel  sorting  environment 
is 

E = H/W 

= (b  log(MP))/(c  log(MP)  + d log2P) 

It  may  easily  be  observed  that  for  a given  P,  E increases  with  M.  If  the 
constants  b and  c are  equal,  then  the  efficiency  E approaches  unity  as  M 
becomes  larger  and  larger.  This  is  one  of  the  major  advantages  of  our  method 
over  other  parallel  sorting  methods.  While  the  best  previously  known 
parallel  sorting  method  can  achieve  a processor  efficiency  of  at  most 
b/(c  + d log  P)  (by  substituting  1 for  M in  the  above  efficiency  formula), 

our  method  achieves  a much  better  processor  efficiency  and  utilization,  since 
M can  take  up  values  greater  than  1. 

8.  CONCLUSIONS 

We  have  shown  that  by  using  P processors  with  sequential  memory  and  log  P 

interconnection  among  processors,  it  is  possible  to  sort  MP  records  in  o(M  log  M + 
2 

M log  P)  time.  It  may  be  noted  that  P is  a power  of  2,  but  M,  the  number  of 
records  assigned  to  each  processor,  can  be  any  positive  integer.  The  number  of 
records  that  can  be  sorted  in  a batch  is  restricted  only  by  the  memory  size  of 
each  processor  and  not  by  the  number  of  processors.  In  fact,  an  efficiency 
measure  is  introduced  which  shows  that  processor  utilization  increases  with 
increasing  values  of  M. 
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